In its essence, a program is a sequence of instructions that specifies how to perform a computation.
Computations can be of different types:
Every programming language has its own quirks but in general, every program (even the more complex ones) is a combination of a few basic instructions:
The key building blocks of most languages are objects. In R, objects may be variables, arrays of numbers, character strings, functions, or more general structures built from such components. All R statements where you create objects, assignment statements, have the same form:
Where <- is the assignment operator. This statement
tells R that an object called object_name should be created
and assigned the value value. A specific example could
be:
It is a good practice to use descriptive names and a consistent style. People have their own taste. Some commonly used alternatives are:
iMostlyUseCamelCase
other_people_use_snake_case
some.people.use.periods
And_aFew.People_RENOUNCEconvention # Not recommendedYou can inspect an object by typing its name:
## [1] "Hello World"
Exercise
Try to create a new object called myFirstObject and
assign it the value 5.4:
Then inspect it and check that everything worked as expected:
Solution
## [1] 5.4
Objects in R can be of different types (modes). The basic types of objects are:
numeric, integer (no decimals) or
double (with decimals)logical (TRUE or FALSE values)character (strings)You can inspect an object’s type by typing:
## [1] "double"
A series of objects of the same type can be grouped into a
vector. Vectors in R are defined as follows:
If you try to create a vector with objects of different types, R will try to coerce them to a common type:
## [1] "1" "3" "5" "FALSE"
In that case you’ll need to use a list:
Objects can be easily converted from one type to another (when it makes sense to do so). Here is an example:
## [1] 0 1 2 3 4 5 6 7 8 9
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
A further coercion reconstructs the numerical vector:
## [1] 0 1 2 3 4 5 6 7 8 9
A special and very useful type of object is a function. As a
general concept, a function takes some objects as inputs or parameters
and uses them to complete an action or produce another object. For
example, the seq() function takes three main arguments and
returns the sequence of values from from to to
in steps of length by.
## [1] 1 2 3 4 5
This works because seq belongs to the core of the R
language.
You can define your own functions in the same way you create an object:
For example, we can create a function that takes a vector of numbers and adds 2 to each of its elements:
add2 <- function(numbers) {
return(numbers + 2) # The return statement tells R to output the result
}
add2(1:4)## [1] 3 4 5 6
In our last example, you’ve seen how R can perform basic mathematical operations. Mathematical operators mostly work as expected:
In addition to mathematical operators, R also has logical operators:
Exercise
Try to define a function that takes the following vector of numbers as input and returns a vector in which each element has been divided by 2 and raised to the 2nd power:
Now check that it produces the desired result:
Solution
Now check that it produces the desired result:
## [1] 0.25 1.00 2.25 4.00 6.25
Often you need to perform one operation or another based on a given condition. This is what it is known as conditional execution. In R, you can implement it as follows:
The expression or object that replaces someCondition
must evaluate to either TRUE or FALSE.
Here is an example:
## [1] "hot"
Conditional expressions can also be chained:
if (someCondition) {
executeThis()
} else if (someOtherCondition) {
executeThat()
} else {
executeSomethingElse()
}or nested:
Exercise
Write a function that computed the square root of a number if the number is strictly positive and the square root of its absolute value if it is negative:
computeSquare <- function(number) {
if (someCondition) { # change the condition
# write your code here
} else {
# write your code here
}
}Now check that it produces the desired result:
Solution
computeSquare <- function(number) {
if (number > 0) {
return(sqrt(number))
} else {
return(sqrt(abs(number)))
}
}## [1] 2
## [1] 4
Sometimes you need to repeat the same operation more than once with some variation. Here is an example:
## [1] 3
## [1] 2
## [1] 1
## [1] "Blastoff!"
The while statement check if a condition is
TRUE (here n>0) and keeps executing the
code in curly brackets when (if ever) it turns FALSE. Make
sure that you are not setting up an infinite loop!
A second alternative when you know exactly how many iterations you
need to run is the for statement. Here is how we wuld
rewrite the countdown function with a for
loop.
## [1] 3
## [1] 2
## [1] 1
## [1] "Blastoff!"
tidyverseWhat we explored so far is part of the core R language. Indeed, we did not import any package.
A package is a collection of functions pulled together by someone in order to perform a given set of operations. in R, packages are installed and loaded with the commands:
Once a package is loaded, you can use its functions as if they were R
core functions. If you want help with a function, you can type
?functionName in your console to access its
documentation.
There are many specialized packages in R, to build visualizations, to
work with geographical data, to perform text analysis, etc.. Among R
packages, the tidyverse collection
stands out as coherent framework to perform all the most common data
science operations, from reading the data, to cleaning it, to
visualizing it.
Today, I’ll be illustrating the following packages:
readr: read and write datadplyr and tidyr: data wranglingggplot: visualizationrmarkdown and knitr: build smart
documentsreadr Reading and
Writing Data.csv is by far the most common format for rectangular
data. You can read .csv files into R as follows:
You can select which columns to read with the col_select
option, specify column types with the col_types option, and
specify a non standard delimiter with the delim option.
read_csv is highly customizable and offers many useful
options to take care of parsing problems. readr also offers
the read_fwf function to read fixed width files (relatively
common).
Once you have transformed the data you read and are satisfied with the result, you can write it to a file as follows:
tidyr and
dplyr Transforming DataOften, the data available to you is in a different format from the
one you need. The tidyverse generally expects a dataset to
follow these three principles:
There are three interrelated rules which make a dataset tidy:
Consider the following table:
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
This table violates rule 1 because two variables, cases
and population share the same column.
We can easily convert the table to a tidy format with the following code:
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Imagine now that the data came as two separate tables:
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## # A tibble: 3 × 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Even in this case, we can easily create a tidy table. First, we need to collapse the different years into a unique variable:
and then we join them
table4aTidy <- table4a %>%
pivot_longer(`1999`:`2000`,
names_to = 'year',values_to = 'cases')
table4b %>%
pivot_longer(`1999`:`2000`,
names_to = 'year',values_to = 'population') %>%
left_join(table4aTidy,by=c('country','year'))## # A tibble: 6 × 4
## country year population cases
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 19987071 745
## 2 Afghanistan 2000 20595360 2666
## 3 Brazil 1999 172006362 37737
## 4 Brazil 2000 174504898 80488
## 5 China 1999 1272915272 212258
## 6 China 2000 1280428583 213766
Other useful functions to transform datasets are:
separate: to pull apart one column into multiple
columns, by splitting wherever a separator character appears.unite: to combine multiple columns into a single
column.complete: takes a set of columns, and finds all unique
combinations. It then ensures the original dataset contains all those
values, filling in where necessary.fill: takes a set of columns where you want missing
values to be replaced by the most recent non-missing value.Exercise
Consider this table:
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Separate the rate variable into a cases and
a population variable. Recall that you can look up the
documentation for a function by typing ?myFunction.
Exercise (Continued)
And check if it worked:
Solution
And check if it worked:
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Let us pause for a moment. In the code on the previous slides I’ve
been frequently using the %>% operator. How does this
work?
This pipe operator is implemented by the magrittr
package and imported with the tidyverse. It passes the
output of one operator or function to the next one as its first
argument. Most tidyverse functions work very naturally with
the pipe but other functions work as well with the following syntax:
In one of the previous examples, I used the left_join
function to combine two datasets through a common key. This is
a very common operation and is implemented by the *_join
family of functions.
inner_join: keeps only entries found in both
databasesleft_join: keeps only entries found in left
databaseright_join: keeps only entries found in the right
databaseouter_join: keeps any entry found in one of the two
databasesYou can specify which keys should be used for the join operation with
the by parameter.
Here is a graphic representation of what each *_join
function does.
You’ll often want to add, remove, reorder, or modify variables in a
dataset. This is the job of dplyr.
mutate function
(transmute removes exisiting variables).select function.filter function.relocate
function.arrange function.Here is an example:
table1 %>%
mutate(rate = cases/population) %>% # we add a new variables
filter(population > 100000000) %>% # keep if pop > 100 million
relocate(year) %>% # we move year to the first position
select(country,year,rate) %>% # we keep only year, country, and rate
arrange(year,country) # we sort by year and country## # A tibble: 4 × 3
## country year rate
## <chr> <int> <dbl>
## 1 Brazil 1999 0.000219
## 2 China 1999 0.000167
## 3 Brazil 2000 0.000461
## 4 China 2000 0.000167
The syntax of select is used in many other function
throughout the tidyverse. It is thus worth it to explore it
a little bit more:
var1:var2: Selects all variables between var1 and var
2var1,var3,...,var5: Selects individual variables var1,
var3, …, var5starts_with('someString'): Selets all variables
starting with someStringends_with('someString'): Selets all variables ending
with someStringThe - operator in front of any selection
(i.e. select(-selection)) excludes the selected variables
and keeps the remaining ones
Exercise
Starting from table1, keep only rows for Afghanistan,
add a new variable called rate equal to the number of cases
for 100,000 individuals and remove the cases and
population columns.
Solution
table1 %>%
filter(country == 'Afghanistan') %>%
mutate(rate = (cases/population)*100000) %>%
select(-c(cases,population))## # A tibble: 2 × 3
## country year rate
## <chr> <int> <dbl>
## 1 Afghanistan 1999 3.73
## 2 Afghanistan 2000 12.9
Within mutate we can perform some clever operations
taking advantage of dplyr functions.
## # A tibble: 6 × 5
## country year cases population highPop
## <chr> <int> <int> <int> <lgl>
## 1 Afghanistan 1999 745 19987071 FALSE
## 2 Afghanistan 2000 2666 20595360 FALSE
## 3 Brazil 1999 37737 172006362 TRUE
## 4 Brazil 2000 80488 174504898 TRUE
## 5 China 1999 212258 1272915272 TRUE
## 6 China 2000 213766 1280428583 TRUE
The if_else function select the first or the second
input based on a condition. Here we create the highPop
variable which equals to FALSE if the population is smaller
than 100 million and TRUE otherwise.
When you have more than two cases, you should replace
if_else with case_when instead of using nested
clauses.
table1 %>%
mutate(popSize = case_when(population > 1000000000 ~ 'Very Large',
population > 100000000 ~ 'Large',
TRUE ~ 'Medium or Small')) ## # A tibble: 6 × 5
## country year cases population popSize
## <chr> <int> <int> <int> <chr>
## 1 Afghanistan 1999 745 19987071 Medium or Small
## 2 Afghanistan 2000 2666 20595360 Medium or Small
## 3 Brazil 1999 37737 172006362 Large
## 4 Brazil 2000 80488 174504898 Large
## 5 China 1999 212258 1272915272 Very Large
## 6 China 2000 213766 1280428583 Very Large
case_when works like an if statement.
Conditions are evaluated sequentially so you should move from the most
specific to the most general condition. If none of the cases is matched,
a NA value is returned. You can, however, use
TRUE as the last condition to catch all the remaining
cases.
If you need to perform the same operation on multiple variables, you
can use the across function:
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <dbl> <dbl>
## 1 Afghanistan 1999 0.745 19987.
## 2 Afghanistan 2000 2.67 20595.
## 3 Brazil 1999 37.7 172006.
## 4 Brazil 2000 80.5 174505.
## 5 China 1999 212. 1272915.
## 6 China 2000 214. 1280429.
This may seem redundant here, but it becomes very useful when you have many variables.
Often you’ll need to work with groups within the data. For example,
you might want to compute the maximum number of cases by country. Within
dplyr, you can do that with the group_by
function.
table1 %>%
group_by(country) %>%
mutate(maxCases = max(cases)) %>%
ungroup() # we remove the grouping## # A tibble: 6 × 5
## country year cases population maxCases
## <chr> <int> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071 2666
## 2 Afghanistan 2000 2666 20595360 2666
## 3 Brazil 1999 37737 172006362 80488
## 4 Brazil 2000 80488 174504898 80488
## 5 China 1999 212258 1272915272 213766
## 6 China 2000 213766 1280428583 213766
If you want to compute a summary of the data by group and then
preserve only one observation for each group, you can replace the
mutate function with summarize.
table1 %>%
group_by(country) %>%
summarize(maxCases = max(cases)) %>%
ungroup() # we remove the grouping## # A tibble: 3 × 2
## country maxCases
## <chr> <int>
## 1 Afghanistan 2666
## 2 Brazil 80488
## 3 China 213766
Notice that summarize removes all the other variables.
Exercise
Compute the average population by year and then filter the dataset so that it only contains observations from the year with the highest average population. Then remove the intermediate variable you created to store the maximum population.
Solution
table1 %>%
group_by(year) %>%
mutate(averagePop = mean(population)) %>%
ungroup() %>%
filter(averagePop == max(averagePop)) %>%
select(-averagePop)## # A tibble: 3 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 2000 2666 20595360
## 2 Brazil 2000 80488 174504898
## 3 China 2000 213766 1280428583
ggplot2ggplot2 is a general purpose library for visualizing
data. The key principle of ggplot2 is the idea of building
up a plot by combining different layers, each responsible for a specific
function. You can learn more about ggplot2 philosophy here but
it’s probably best to start with an example. We will be using the
following classic dataset:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We wish to plot sepal length against sepal width.
ggplot(data=iris)+ # Notice that we use + and not the pipe %>%
geom_point(mapping=aes(x=Sepal.Length,y=Sepal.Width))We specify that the data we want to use is in the iris
object. Then we add our first layer geom_point and tell
ggplot2 that Sepal.Length should be mapped to
the x-axis and Sepal.Width should be mapped to the
y-axis.
Congratulations! This is your first graph. Suppose we want to change point shape based on the species.
ggplot(data=iris) + # Notice that we use + and not the pipe %>%
geom_point(mapping=aes(x=Sepal.Length,y=Sepal.Width,
shape=Species))We just need to map the Species variable to the shape
aesthetic.
If we felt that the shape alone is not enought to help the viewer separate points by species, we could add color:
plot <- ggplot(data=iris,
mapping=aes(x=Sepal.Length,y=Sepal.Width) )+
geom_point(mapping=aes(shape=Species,color=Species))
plotSuppose we want to focus on a specific portion of the graph:
coord_cartesian controls the coordinates and can zoom on
a part of the plot. Notice that we have stored our previous plot into
the plot object and we are now adding more layers to it
with the + operator.
We can change the axis labels and add a title:
We could also add a regression line:
Maybe it would be better to have each species in a separate graph:
facet_grid allows to specify which variables should be
used in the rows and which in the columns.
We are not confined to points. Here is a histogram of sepal length.
There are many different geometries you can use:
geom_linegeom_boxplotgeom_densitygeom_mapgeom_errorbargeom_ribbonand many more.
You can customize almost every aspect of a plot: axes, labels, grid
lines, legend, orientation, size, palettes. However,
ggplot2 also offers a set of themes that modify several
aspects of a figure at once:
and more are available through the ggthemes package.
You can combine different geometries and other elements to build very complex visualizations. Here are two examples from my own research.
Exercise
Using the iris dataset, create a new graph with a
different box plot of Sepal.Width for each Species. Label
the axes in an appropriate way. Use the theme you prefer.
Solution
ggplot(data=iris) +
geom_boxplot(mapping=aes(x=Species,y=Sepal.Width)) +
labs(y='Sepal Width') +
theme_bw()You might have noticed how this presentation combines formatted text, R code and plots. This presentation uses RMarkdown, a system integrated in RStudio that lets you write code through a notebook interface and create reproducible documents.
If you clone the repository where I uploaded this presentation’s material you’ll be able to recreate this document with a simple click of the “Knit” button in RStudio.
You have many different output types you can choose from: html, markdown, doc, pdf (through LaTex), and others.
You can learn more about RMarkdown functionalities here.
You might be tempted to use for loops to repeat an
operation on all elements of a set. However, this will likely prove to
be slow. That’s because in R (and many other languages) iterations in a
for loops are executed sequentially even when the could be
executed in parallel.
R makes is relatively easy to transform certain for
loops into their vectorized equivalent (i.e. implemented in
such a way that multiple iterations are run simultaneously). Suppose we
want to compute the exponential of log of x*0.3 + 1 for all
the elements of a vector. We could define a function to perform such
operation on a single number:
We could then apply this function to the elements of our vector with
a for loop as follows:
n <- 100
numbers <- seq(0,2,length.out=n)
result <- vector(mode = "list", length = n)
for(i in 1:n){
result[[i]] = arbitraryOperation(numbers[i])
}
result <- as.numeric(result)Alternatively, we could write a more compact expression using the
sapply functional. This function applies a given function
to all the elements of a vector and returns another vector.
Because R is a smart language, many basic functions are automatically
vectorized when applied over vectors. For example, the exp,
the +, the *, and the log
operations can all be applied directly to an entire vector (and our
arbitraryOperation function inherits this property). Let us
now compare the performance of these three alternatives.
| Implementation | Median Execution Time | Memory Allocation |
|---|---|---|
| For Loop | 2.03ms | 18.82KB |
| Sapply | 76.59µs | 3.58KB |
| Vectorized | 3.39µs | 1.66KB |
With a vector of size 100, the sapply solution is about
30 times faster than the for loop and the vectorized
version is about 29 times faster than the sapply solution.
In general, if a vectorized version is available you should use it.
As a general note, you might not always care about performance. If you are not writing computationally intensive code, readability should be given a higher priority.
At the same time, R is not a fast language so if performance is key to your project, you might have to look elsewhere.